# A tibble: 4 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
Suppose the variable hwy (fuel efficiency on highway) is very expensive to measure.
We decide to estimate it using the other variables. To do so, we will fit a regression model.
We can use our models to estimate hwy for new vehicles.
Imagine there is a new vehicle with \(\text{cty} = 30\). Instead of measuring its hwy (expensive), we use our model to estimate it. Using the “good” model gives the following estimate
We want to choose estimates that give a model that fits the data well.
a model with a regression line that is close to the data
We want to minimize the residuals.
Minimizing residuals
Perhaps the most natural thing to do is to find the values for \(\beta_0\) and \(\beta_1\) that minimize the sum of absolute residuals\[
|e_1|+|e_2|+\dots+|e_n|
\]
For practical reasons, the sum of squared residuals (SSR) is a more common criterion \[
e_1^2+e_2^2+\dots+e_n^2
\]
Why squaring the residuals?
can work by hand (pre-computer era)
reflects the assumptions that being off by \(4\) is more than twice as bad as being off by \(2\)
nice mathematical properties
mainstream
Least-square estimates
We find the values for \(\beta_0\) and \(\beta_1\) that minimize the SSR with the R command lm
Note that the slope coefficient is negative; which makes sense since cars with larger engines would tend to be less efficient.
We now have two models. Which is the best?
We could start by looking at the residuals
Comparing residuals
m <-lm(hwy ~ cty, data = d)m_augment <-augment(m)ggplot(m_augment) +geom_histogram(aes(.resid))
lm(hwy ~ displ, data = d) %>% augment %>%ggplot() +geom_histogram(aes(.resid))
The first model seems to have smaller residuals.
\(\Rightarrow\) choose the first model (cty)!
But looking at a plot can be misleading
illusions
difficult to compare models with similar residuals
We need a more systematic approach for comparing models.
SSR
Instead of comparing histograms of residuals, we can compute the SSR (sum of squared residuals!)
\[
SSR = r_1^2+r_2^2+\dots+r^2_n
\]
small residuals will give a small SSR
large residuals will give a large SSR
\(\Rightarrow\) choose the model with the smaller SSR!
📋 The textbook uses the term SSE (sum of squared errors).
\(R^2\)
While the SSR can be used to select a model, it is also useful in describing the goodness of fit of the model.
The SST (total sum of squares) is the sum of squared distance to the mean. \[
SST = (x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \dots + (x_n - \bar{x})^2
\] It measures the total amount of variability in the data.
Remember the formula for SSR\[
SSR = (x_1 - \hat{x})^2 + (x_2 - \hat{x}_1)^2 + \dots + (x_n - \hat{x}_n)^2
\] It measures the amount of variability in the data left unexplained by the model.
\(SST - SSR\) is therefore the amount of variation explained by the model: \[
\text{data} = SST = (SST-SSR) + SSR = \text{model} + \text{residuals}
\]
The statistic \(R^2\) measures the proportion of variation in the data that is explained by the model.